MGCNSS: miRNA–disease association prediction with multi-layer graph convolution and distance-based negative sample selection strategy

Abstract Identifying disease-associated microRNAs (miRNAs) could help understand the deep mechanism of diseases, which promotes the development of new medicine. Recently, network-based approaches have been widely proposed for inferring the potential associations between miRNAs and diseases. However, these approaches ignore the importance of different relations in meta-paths when learning the embeddings of miRNAs and diseases. Besides, they pay little attention to screening out reliable negative samples which is crucial for improving the prediction accuracy. In this study, we propose a novel approach named MGCNSS with the multi-layer graph convolution and high-quality negative sample selection strategy. Specifically, MGCNSS first constructs a comprehensive heterogeneous network by integrating miRNA and disease similarity networks coupled with their known association relationships. Then, we employ the multi-layer graph convolution to automatically capture the meta-path relations with different lengths in the heterogeneous network and learn the discriminative representations of miRNAs and diseases. After that, MGCNSS establishes a highly reliable negative sample set from the unlabeled sample set with the negative distance-based sample selection strategy. Finally, we train MGCNSS under an unsupervised learning manner and predict the potential associations between miRNAs and diseases. The experimental results fully demonstrate that MGCNSS outperforms all baseline methods on both balanced and imbalanced datasets. More importantly, we conduct case studies on colon neoplasms and esophageal neoplasms, further confirming the ability of MGCNSS to detect potential candidate miRNAs. The source code is publicly available on GitHub https://github.com/15136943622/MGCNSS/tree/master


INTRODUCTION
MicroRNA (miRNA) is one type of non-coding single-stranded-RNAs, which usually has a length of about 22 nucleotides [1].Research has demonstrated that many miRNAs could participate in the regulation of gene expression after transcription in both animals and plants [2].They could also serve as potential diagnostic markers and therapeutic targets [3].These miRNAs regulate about one-third of human genes that are highly associated with complex human diseases.For example, miR-17, miR-92a and miR-31 have been validated as biomarkers for colorectal cancer, which is crucial for the diagnosis and treatment of colorectal cancer [4].Thus, identifying the associations between miRNAs and diseases could promote research on the mechanism of diseases and help the treatment of diseases.
Identifying miRNA-disease associations through the wetexperimental strategy is time-consuming, labor-intensive and low efficiency [5].More recently, the development of biotechnology has promoted the emergence of computational-based methods which could greatly improve prediction efficiency.Currently, these computational-based approaches can be divided into three categories: similarity-based methods, machine learning-based methods and graph-based methods [6].
For these similarity-based methods, they usually pay much attention to the similarities calculation between miRNAs and diseases.Therefore, miRNAs and diseases will represented as feature vectors utilizing their multiple types of biological data, such as miRNA sequence, miRNA annotation and gene annotations.The similarity-based methods generally assume that functional similarity miRNAs tend to be associated with these diseases that share similar phenotypes [7].Up to now, various similarity calculation models such as disease semantic similarity [8,9], miRNA functional similarity [10] and Gaussian Interaction Profile (GIP) kernel similarity [11] have been proposed to measure their similarity effectively.For example, Jiang [12] put forward a novel kernel-based fusion strategy to integrate multiple similarities of miRNAs and diseases, and then predict their potential association relationships.GATMDA [13] treated lncRNAs as mediators and put forward the lncRNA-miRNA-disease regulatory mechanism to enhance the similarity calculation of miRNAs and diseases.Besides, Yu et al. [14] employed the miRNA target genes with GO annotations to systematically measure the similarity between miRNAs.To fully use the global network similarity measures, Chen et al. [15] employed the Random Walk with Restart (RWR) algorithm on the miRNA-miRNA functional similarity network to predict miRNA-disease association.MFSP [16] inferred the functional similarity of miRNAs with the pathway-based miRNA-miRNA relations and measured the miRNS similarities between their corresponding associated disease sets.Mørk et al. [17].constructed one miRNA-target-disease heterogeneous network and treated proteins as the intermediary to measure the similarities between miRNAs and diseases.Although these approaches above could measure the similarity between miRNAs or diseases, their calculation models are relatively simple.More importantly, they pay little attention to utilizing the complex relationships of miR-NAs and diseases in the heterogeneous network, which could limit their accuracy in the similarity calculation.
Machine learning-based approaches usually employ various efficient models such as regularized learning, Random Walk, Support Vector Machine (SVM [18]), or decision trees to discover potential miRNA-disease associations.These methods usually have two steps, which are the training step and the inference step [19].For example, LRLSHMDA [20] was a semi-supervised model that employed a Laplacian regularized least squares classifier and effectively utilized the implicit information of vertices and edges to infer microbe-disease associations.AMVML [21] was an adaptive learning-based approach, which could learn novel similarity graphs and similarity graphs for miRNAs and diseases from different views and predict the miRNA-disease associations.EDTMDA [22] put forward one computational framework that adopted the dimensionality reduction strategy to remove redundant features and employed multiple decision trees to infer the association relationships.MTLMDA [23] employed the multi-task learning technique which could exploit both miRNA-disease and gene-disease networks at the same time to complete the association prediction task.EGBMMDA [24] trained a regression tree in a gradient boosting framework, which was the first decision tree-based model used to predict miRNA-disease associations.RFMDA [25] selected robust features with the filter-based feature selection strategy and adopted the Random Forest (RF) model as a classifier to infer the association relationship between miRNAs and diseases.Meanwhile, DNRLMF-MDA [11] calculated association probabilities through logical matrix factorization and further improved predictive performance through dynamic neighborhood regularization.MTDN learned the features from the DTINET [26] adopted the RWR algorithm to obtain initial features and compressed features with DCA and applied the matrix completion approach to calculate projections between different nodes.CCL-DTI [27] treated the multimodal knowledge as input and investigated the effectiveness of the different contrastive loss on the prediction model.At the same time, DeepCompoundNet [28] fully considered multi-source of chemicals and proteins such as protein features and drug properties to predict drug-target interactions.However, these above models mainly have two drawbacks: (1) they usually concatenated the embeddings of miRNAs and diseases learned from different sources, and (2) they failed to extract discriminative embeddings of miRNAs and diseases from the networks with rich semantic information, which limits the ability in learning the representations.
Meanwhile, Graph-based methods have attracted more and more attention due to their outstanding performance.These approaches always construct one heterogeneous graph based on the known miRNA-disease association relationships as well as the similarities of miRNAs and diseases [29].The heterogeneous graph exhibits a powerful ability to depict the complex relationships between miRNAs and diseases, which has been commonly employed in miRNA-disease association prediction tasks.For example, GCN and GAT models were widely adopted in the field of graph learning due to their powerful performance [30].Recently, MMGCN [31] utilized GCN as encoders to obtain features of miRNAs and diseases under different similarity views and then enhanced the representation learning process through multichannel attention mechanisms.GCNDTI [32] constructed a drugprotein pair network and treated node pairs as independent nodes, which transformed the link prediction problem into a node pair classification problem.MAGCN [33] introduced lncRNA-miRNA interactions and miRNA-disease associations to represent miRNAs and diseases, then did MDA predictions via graph convolution networks with the multi-channel attention mechanism and convolutional neural network combiner.MKGAT [34] applied multi-layer GAT to update miRNA or disease features and then fused them through the attention mechanism.AMHMDA [35] first constructed a heterogeneous hyper-graph and applied the attention-aware multi-view similarity strategy to learn the embeddings of miRNAs and diseases in the constructed hyper-graph.DTIHNC [36] utilized GAT and RWR models respectively to learn both direct neighbor and multi-hop neighbor information and then applied a multi-level CNN to fuse node features.HGIMDA [37] adopted a graph neural networkbased encoder to aggregate node neighborhood information to obtain low-dimensional embeddings of nodes for association prediction.SFAGE [6] optimized the original features through random walks and added a reinforcement layer in the hidden layer of the graph convolutional network to preserve the similarity information in the feature space.AutoEdge-CCP [38] models circRNA-cancer-drug interactions by employing a multi-source heterogeneous network, where each molecule combines intrinsic attribute information.However, these approaches do not pay much attention to the multiple meta-path-based relationships between miRNAs and diseases, while these paths usually contain rich meaningful semantics that crucially contribute to the learning of miRNA and disease representations.Besides, they only randomly select the negative samples from the unlabeled sample set, which affects their training performance.
Generally, heterogeneous networks face certain difficulties in the representation learning of nodes due to their complicated structure [39].Hence, is one of the great challenges to design an automated learning framework to fully explore the complex meta-path-based relationships over heterogeneous graphs for embedding learning [40].In addition, it is difficult to obtain high-quality negative samples for the miRNA-disease association prediction task.Consequently, lots of current approaches usually select samples from unlabeled sample sets as negative samples randomly, but there often exist numerous false negative samples, thus affecting the accuracy of prediction models.Therefore, Figure 1.The overall architecture of MGCNSS, which mainly has four steps.(A) MGCNSS establishes the integrated miRNA and disease similarity matrix by fusing their different types of similarities.Then, the RWR algorithm is applied to learn the initial feature matrix H (0) for miRNAs and diseases.(B) MGCNSS adopts a multi-layer graph convolution module to capture the rich semantics of meta-paths with different lengths and learns the discriminative embeddings of miRNAs and diseases.(C) MGCNSS employs cosine-distance and Euclidean-distance-based strategies to select high-quality negative samples.(D) MGCNSS predicts the miRNA-disease associations with their learned embeddings.selecting high-reliable and quality negative training samples is of great significance for miRNA-disease association prediction tasks.
In summary, we propose a novel prediction model with the multi-layer graph convolution and negative sample selection strategy (named MGCNSS).Firstly, we collect their multi-types of similarity and obtain the integrated miRNA and disease similarity networks to fully capture the similarity between miRNAs or diseases.Then, we adopt the multi-layer graph convolution module to automatically capture the rich semantics of metapaths with lengths between miRNAs and diseases and learn their embeddings from different layers.After that, we adopt the negative sample selection strategy to screen out the high-quality negative training samples with the distanced-based strategy.Finally, MGCNSS predicts the miRNA-disease associations with the learned embeddings of miRNAs and diseases.The workf low has been displayed in Figure 1.We summarize the contributions of this study as follows: • MGCNSS could automatically capture the rich semantics of meta-path with different lengths between miRNAs and diseases for learning their discriminative embeddings.

Dataset
In this study, we collected the experimental data from HMDD v2.0 [41], which has 495 miRNAs, 383 diseases, and 5430 miRNA-disease associations that have been experimentally verified.Further, MGCNSS could construct the miRNA-disease association network and the corresponding association matrix that is denoted as A = [A ij ] ∈ R Nm×Nd , where N m and N d represent the number of miRNAs and diseases, respectively.For matrixA, if A ij = 1, there will be an association relationship between m i and d j .Otherwise, A ij = 0 indicates that there is no association relationship between m i and d j .

MiRNA functional similarity
We downloaded the miRNA functional similarity scores from HMDD [42].In this dataset, the functional similarities between all miRNAs are represented in one matrix FM ∈ R Nm×Nm and FM(m i , m j ) denotes the similarity value between miRNA m i and miRNA m j .It is worth noting that a higher level of similarity between two miRNAs if the corresponding score is higher.

Disease semantic similarity
The relationships between diseases have been well described based on the Medical Subject Headings (MeSH) descriptors [43], which could be constructed as one directed acyclic graph (DAG).MGCNSS measures the disease semantic similarity based on this DAG.Specifically, MGCNSS calculates the semantic contribution of disease d to D according to the following formula: where d is the ancestor node of D, and represents the semantic contribution decay factor, which is usually 0.5.According to equation (1), the semantic value of D can be obtained by where T(D) denotes the ancestor node set of D including D itself.Therefore, the semantic similarity between disease d i and disease d j can be calculated as follows: Besides, if disease d occurs only in the DAG of one disease D, but not in the DAG of other diseases, it is necessary to increase the contribution score of disease d to D [44].Therefore, MGCNSS measures the semantic contribution of disease d to D, which is formulated as follows: the number of DAGs including d the total number of diseases (4) Similar with the calculation strategy of FD1, we can formulate the equation for DSM2 and FD2: Finally, these two types of calculation approaches are combined: where FD is the disease semantic similarity matrix.

GIP kernel similarity
Similar to the previous research [45], MGCNSS measures the GIP kernel similarity [11] for miRNAs and diseases based on the miRNA-disease adjacency matrix A. Take building the GIP kernel similarity for miRNA (GM) as an example.Firstly, in the term of the adjacency matrix A, the ith and jth row are treated as the disease interaction profiles for miRNA m i and m j , which are denoted as R i and R j [10].Then, we measure the GIP kernel similarity between miRNA m i and miRNA m j based on R i and R j .The corresponding similarity could be constructed by where α m is the controller of the bandwidth of the kernel and it can be calculated as follows: where N m represents the number of miRNAs, and α m is usually set to 1 referring to previous study.In a similar manner, MGCNSS could establish the similarity matrix GD for diseases.Specifically, MGCNSS treats columns in matrix A as miRNA interaction profiles for the corresponding diseases.Without loss of generality, the relationship vectors for d i and d j are represented as C i and C j , and their similarity is formulated as follows: where N d represents the number of diseases and β m is also set to 1.

lncRNA-based similarity
Many studies have shown that lncRNAs participate in various biological processes, including DNA methylation, posttranscriptional regulation of RNA, and protein translation regulation.As a result, lncRNAs have association relationships with miRNAs and diseases.MGCNSS adopts the association relationships to measure the similarity between miRNAs and diseases separately.The raw data is downloaded from Star-base v2.0 database [46], and the miRNA-lncRNA association matrix and disease-lncRNA association matrix can be obtained from GATMDA [13].Finally, we adopt an edit-distance algorithm [47] to obtain the lncRNA-based similarity matrices for miRNAs and diseases, named LM ∈ R Nm×Nm and LD ∈ R Nd×Nd , respectively.

MGCNSS model
The overall architecture of MGCNSS has been demonstrated in Figure 1.The model architecture could be summarized into four main parts: multi-source data fusion, multi-layer graph convolution, negative sample selection and model prediction.

Multi-source data fusion
Now we have obtained miRNA functional similarity matrix(FM), miRNA Gaussian kernel-based similarity matrix GM, and lncRNAbased miRNA similarity matrix LM, respectively.Next, MGCNSS could get an integrated miRNA similarity matrix name IM based on FM, GM and LM, which is formulated as follows: where α 1 , α 2 and α 3 are hyper-parameters.MGCNSS establishes the disease semantic similarity matrix FD, Gaussian kernel-based disease similarity matrix GD and the lncRNA-based similarity disease matrix LD.The integrated disease similarity matrix ID is denoted as follows: where β 1 , β 2 and β 3 are hyper-parameters.These hyper-parameters are investigated in the Result section.MGCNSS will construct the heterogeneous miRNA-disease association network N hete , which is used for the multi-layer graph convolution module.The matrix representation of N hete is denoted as M, which is formulated as follows: where Nd+Nm) .Finally, MGCNSS initializes the features of miRNAs and diseases with RWR algorithm [36] based on adjacency matrix A. Specifically, The RWR algorithm is formulated as follows: where r represents the restart probability and t is the number of iterations.D 0 t ∈ R (Nd+Nm)×1 is the initial vector of the tth node.Besides, A N is the normalized adjacency matrix A. The algorithm will be performed until Frobenius|D (k+1) − D k | 10 −6 and D k t is the out feature of the tth node at kth iteration.In this way, all the initial features of miRNAs and diseases are obtained and form the feature matrix denoted as D k .

Multilayer graph convolution
Generally, meta-paths have a powerful ability to capture multiple relationships of nodes in the heterogeneous network.It is essential to comprehensively explore the rich complex semantics to learn the embeddings of miRNAs and diseases.Here we employ multi-layer graph convolution to learn their embeddings.
As shown in Figure 1(B), the graph convolutional module consists of multiple graph convolutional layers, which could fully capture the meaning of meta-paths with different lengths.Specifically, the first layer of convolution can be represented as follows: where H (1) ∈ R (Nm+Nd)×d is the feature matrix for meta-paths with length 1. H (0) = D k is the input feature that obtained from RWR, and W (1) ∈ R (Nm+Nd)×d is the learnable weight matrix, where d is the embedding size for output features of miRNAs and diseases.The two-layer convolution is formulated as follows: where H (2) ∈ R (Nm+Nd)×d and W (2) ∈ R d×d .Similarly, the l-layer convolution is denoted as follows: Finally, MGCNSS could establish l feature matrixes, corresponding to the meta-paths with l different lengths.Notably, lower-order convolutional layers tend to focus on the neighbors of miRNAs or diseases, while higher-order convolutional layers could capture the relationships from long distances.Then, we apply attention coefficients here to combine the l feature matrices with different weights: where H is the final output of the multi-layer graph convolution module and the ultimate feature matrix of miRNAs and diseases.
Figure 2 is a toy example demonstrating the process of the multi-layer graph convolution in capturing the semantics of meta-paths with length 2 automatically.

Negative sample selection strategy
In the association prediction tasks, the quality of negative samples always affects the performance of the prediction models [48].Positive samples can be directly collected, but obtaining the ground-truth negative samples is a challenging task [49].Therefore, most current approaches always treat the known miRNA-disease associations as the positive samples, while the remained relationships are regarded as the unlabeled samples.Moreover, previous research usually randomly selects samples with a certain number from unlabeled samples to form the negative sample set [50].Generally, this negative selection strategy may introduce some dirty samples, which may interfere with the model training process and reduce the prediction accuracy.
To solve this problem, we propose a distance-based negative sample selection strategy including cosine distance and Euclidean distance to select reliable negative samples.Firstly, we combine the k-means clustering algorithm [51] to generate a centroid vector C p for the positive sample set P set and a centroid vector C u for the unlabeled sample set U set .Specifically, C p and C u are calculated by the following formula: where v i denotes the ith miRNA-disease pair and F v i denotes its vector representation.Specifically, Nm+Nd) is formed by concatenating the feature vectors of miRNA-disease pairs from the positive sample set.Similarly, Nm+Nd) is the concatenated vectors in unlabeled sample set.Next, we compare the cosine similarity (CS) between each unlabeled sample and these two centroid vectors C p and C u .For one sample v i ∈ U set , if CS p is greater, we put it into a potential positive sample set P l .Conversely, if CS u is greater, it will be put into a potential negative sample set N l .The formulas for CS calculation for CS p and CS u are defined as follows: ) In this manner, we can obtain P l set and N l set and calculate their corresponding novel centroid vectors C p and C u with Equations (21) and (22), respectively.Then we compare the samples v i ∈ U set with the new centroid vectors C p and C u by Euclidean similarity (ES) measurement and divide them accordingly.The formula for calculating the ES is as follows: ) MGCNSS divides the samples in U set into the P l and N l according to their ES p and ES u values and starts the next iteration.This process is repeated until the centroids converge, which meets the following conditions: where • is the Frobenius norm operation.The N l in the last iteration is treated as the final reliable negative sample set denoted as N l .The selection process is shown in Figure 3 and the algorithm is presented in Algorithm 1.

Model training
Finally, we train MGCNSS by minimizing the binary cross-entropy loss function to optimize the model parameters, which is formulated as follows: where H m represents the mth row vector of the feature matrix obtained from the convolutional layer, σ denotes the sigmoid function, and H m , H d represents the inner product of H m and H d .

Time complexity analysis
Here we analyze the time complexity of MGCNSS.As is shown in Figure 1, MGCNSS mainly has three modules, which are multisource data fusion, multi-layer graph convolution and negative sample selection.Suppose that there are m miRNAs and n diseases, MGCNSS has to measure the similarities between all the miRNAs and diseases, and the time complexity is O(m

RESULTS
In this section, we first brief ly introduce the implementation details and evaluation metrics used in this study.Then the comparison results of MGCNSS as well as the baselines are well presented.After that, the ablation experiments and the parameter sensitivity experiments are demonstrated.Last, the case studies are displayed.

Implementation details and evaluation metrics
For MGCNSS, the embedding size of the miRNAs and diseases for prediction is 256, the learning rate is 0.0005 and the weight decay is 0.0005.Besides, the number of training epochs is uniformly 2000.In the multi-layer graph convolution module, the number of convolution layers l is 2. Meanwhile, to comprehensively evaluate the performance of the prediction models, we establish two types of experiment datasets according to the ratio between the number of positive and negative pairs.Specifically, on the balanced dataset, the ratio between the number of positive and negative samples is 1:1.On the imbalanced dataset, the ratio between the number of positive and negative samples is 1:5 and 1:10.
Finally, we employ the widely used evaluation metrics widely used [52] to evaluate the performance of MGCNSS and the comparison approaches, which are accuracy (ACC), area under receiver operating characteristic curves (AUC) and area under the precision-recall curves (AUPR).Specifically, AUC is commonly employed to evaluate the stability of the prediction model, while ACC is adopted to measure the ability of the prediction model in the accuracy predicting aspect.AUPR is another crucial metric for evaluating the performance of prediction models on the imbalanced dataset.
Besides, we employ the 5-folder cross-validation strategy (5-CV) to further evaluate the performance of MGCNSS.Figure 4 demonstrates the process of 5-CV on the balanced dataset.MGCNSS first employs the negative sample selection strategy to choose the likely negative samples from all the negative samples.Then MGCNSS selects M samples from the likely negative sample set.The number of positive samples is also M. After that, MGCNSS divides the positive and selected negative samples into five folders randomly and performs the 5-CV experiment.Each testing folder will be used to evaluate the performance of MGCNSS in turn for each iteration.Finally, MGCNSS adopts the average value of each iteration as the evaluation result for the corresponding epochs.The description above is the execution process for the one-time epoch.With the increase of the epoch number, BCE loss will be converged and MGCNSS could get the best performance and the corresponding hyperparameters.
Moreover, the selection for the hyperparameters used in this study is based on the training dataset.The negative sampleselection strategy is also applied to the training dataset as well as the test dataset.Besides, in the parameter sensitivity analysis section, the presented results are derived from the test dataset.

Comparison with other baseline models
Here, we mainly select eleven competitive methods to compare with the proposed model, which are SVM [18], RF [53], XGBoost [54], GCN [55], GAT [56], DTIGAT [57], DTICNN [58], NeoDTI [59], MSGCL [3], AMHMDA [35] and GATMDA [13]: • SVM [18]: is a classic supervised learning algorithm, and we feed the learned embeddings of drugs and targets directly for MDA prediction.• RF [53]: is an ensemble learning method that combines multiple decision trees for DTI prediction.• XGBoost [54]: is a widely used gradient boosting framework in which the features of miRNAs and diseases are fed directly for MDA prediction.• GCN [55]: is a neural network architecture designed for graph-structured data, which is employed to learn the embeddings of miRNAs and diseases for MDA association prediction.
• GAT [56]: is also a neural network architecture that utilizes the attention mechanism in the feature learning process.• DTIGAT [57]: is an end-to-end framework that assigns different weights to node neighbors with the self-attention mechanisms for DTI prediction.• DTICNN [58]: adopts RWR algorithm to extract features and employs a denoising auto-encoder for dimensionality reduction for MDA predictions.• NeoDTI [59]: is an end-to-end model that could integrate different information and automatically learn topologypreserve representations.• MSGCL [3]: adopts the multi-view self-supervised contrastive learning for MDA prediction that could enhance the latent representation by maximizing the consistency between different views.• AMHMDA [35]: applies the attention-aware multi-view similarity strategy to learn the embeddings of nodes from the heterogeneous hyper-graph to predict the miRNA-disease associations.• GATMDA [13]: could both fuse linear and non-linear embeddings of miRNAs and diseases and adopt the RF model to complete the prediction task.
We first perform the comparison experiment on the balanced dataset, in which the ratio between the number of positive and negative samples is 1:1.The results are shown in Table 1.It can be observed that MGCNSS outperforms all baseline methods significantly in this scenario.Specifically, the results of MGCNSS on AUC, ACC and AUPR metrics are 0.9874, 0.9453 and 0.9882 respectively.Besides, XGBoost wins the second rank on AUC and AUPR, respectively, and the corresponding values are 0.9353 and 0.9355.Meanwhile, GAT gets the second highest score on ACC metric and its value is 0.8647.
Moreover, we vary the ratio between the number of positive and negative samples, which are 1:5 and 1:10.The corresponding results are also listed in Table 2. From the results, we can see that the proposed method wins the best performance on all the evaluation metrics.Specifically, the results of MGCNSS with the 1:5 ratio are 0.9861, 0.9586 and 0.9758 on AUC, ACC and AUPR metrics, while those of MGCNSS with the 1:10 ratio are 0.9871, 0.9786 and 0.9385 on the corresponding metrics.Besides, XGBoost gets the second best on AUC and AUPR with the 1:5 ratio and 1:10 ratio, on ACC with the 1:10 ratio.Meanwhile, DTIGAT ranks second on ACC with a 1:5 ratio.

Table 1:
The evaluation results of MGCNSS and baseline methods with 1:1 ratios on AUC, ACC and AUPR metrics.
The results demonstrate that the performance of MGCNSS w/o NSST is inferior to MGCNSS.For example, in Table 1, the values of MGCNSS w/o NSST on AUC, ACC and AUPR are 0.9437, 0.8859 and 0.9125, which are lower than those of MGCNSS by 3.1, 6.3 and 7.6% respectively.The metric values of MGCNSS w/o NSST in Table 2 are also inferior to those of MGCNSS and here we don't repeat these results anymore.In conclusion, the results presented in Tables 1 and 2 could illustrate that the negative sample selection strategy is essential for MGCNSS.The results of this ablation experiment demonstrate that the proposed negative sample selection strategy affects the performance of MGCNSS.

Multi-source data fusion, multi-layer graph convolution and negative sample selection
In MGCNSS, there are three essential modules which are the multiple similarities integration module (denoted as MI, see Figure 1A), the meta-path-based multi-layer graph convolution module (denoted as MP; see Figure 1B) and the negative sample selection module (denoted as SS; see Figure 1C).  5.
Besides, we conducted validation on the imbalanced dataset.Specifically, the experiments were performed with the 1:5 ratio and the results are shown in Figure 6.The results demonstrate that on the imbalanced dataset, the SS module significantly improves the model performance.The values of MGCNSS on AUC, ACC and AUPR metrics are 0.9871, 0.9671 and 0.9609, respectively.Compared with the variants of MGCNSS, the performance of MGC-NSS is competitive.The results shown in Figure 6 further show SS, MP and MI modules are all essential in improving the prediction accuracy.Meanwhile, the results demonstrate that meta-pathbased multi-layer graph convolution plays an essential role in improving the performance of MGCNSS.

Different similarities of miRNAs and diseases in multi-source data fusion
In the multi-source data fusion module (Figure 1A), MGCNSS integrates three different types of similarities, which are miRNA functional similarity and disease semantic similarity (denoted as FSM), GIP Kernel similarity of miRNAs and diseases (denoted as GSM), lncRNA-based similarity of miRNAs and diseases (denoted as LSM), respectively.This ablation experiment is formulated as MGCNSS w/o FSM, MGCNSS w/o GSM, MGCNSS w/o LSM and MGCNSS, and their corresponding results on the balanced dataset are shown in Figure 7.
The results shown in Figure 7 indicate that MGCNSS wins the highest scores on AUC, ACC and AUPR metrics.To be specific, MGCNSS gets the highest scores on the AUC, ACC and AUPR

The performance of MGCNSS based on meta-paths with different lengths
To fully evaluate the effect of meta-paths with different lengths on MGCNSS, we divide the meta-paths into different combinations {1}, {2}, {3}, {1,2,3} and {1,2}, which are shown in Table 3. Specifically, the corresponding results for each combination are also presented in Table 3. MGCNSS on meta-path with lengths {1,2} wins the best performance, and the AUC, ACC and AUPR values are 0.9874, 0.9453 and 0.9882 respectively.Besides, MGCNSS with length {1} achieves the second rank on AUC, AUPR are 0.9819, 0.9355 and 0.9835.Meanwhile, we also find that the performance of MGCNSS on meta-paths with length {1,2,3} is competitive.This may be because embeddings learned from the meta-path with length 3 may have noise, which leads to a decrease in the performance of MGCNSS.It is worth noting that meta-paths with length {1,2} have the greatest impact on the performance of MGCNSS.

Parameter sensitivity analysis
In this section, we first conduct the parameter sensitivity analysis experiments, which are the learning rates, embedding sizes, and the number of graph convolution layers.Then, we introduce the procedure for selecting hyperparameters in Equations ( 12) and (13).
The first parameter is the learning rate, which is a hyperparameter that controls how much to change one model in response to the estimated error [60].It is one crucial task to choose a proper learning rate for MGCNSS.In this study, we choose the learning rate from 0.0001, 0.0005, 0.001, 0.01 and 0.1, respectively, and their corresponding results are shown in Figure 8(A).MGCNSS achieves the best performance on all metrics when the learning rate is 0.0005.The results show that the performance of MGCNSS gets better when the learning rate increases from 0.0001 to 0.0005, while the metrics decrease when the learning rate ranges from 0.0005 and 0.1.As a result, MGCNSS adopts 0.0005 as its best learning rate.
The second parameter is the embedding size of miRNAs and diseases, which is essential for the performance of MGCNSS.In this study, we vary the embedding size from 64, 128, 256, 512 and 1,024 and the corresponding results are shown in Figure 8(B).It can be seen that MGCNSS gets its best performance when the embedding size is 256.In conclusion, we adopt 256 as the best embedding size for MGCNSS.
The third parameter is the number of convolution layers, which affects the performance of MGCNSS.Here we choose the number of graph convolution layers from 1, 2, 3 and 4 and then obtain the AUC, ACC and AUPR values.Results shown in Figure 8(C) illustrate that the proposed model wins the highest scores when the number of graph convolution layers is 2. Notably, the performance of MGCNSS begins to decline when the number of convolutional layers is larger than 2. It may be that 1-length and 2-length meta-paths could have been able to fully capture the semantics between nodes, while the longer paths would not be helpful for the embedding learning of miRNAs and diseases.Therefore, the number of graph convolutional layers is set to 2.
Besides, for hyperparameters in Eq. 12 and Eq 13, MGCNSS adopts the BCEloss to select their proper values.As is shown in Figure 9, we can see that the BCE loss tends to converge with the increase of epoch number.Specifically, when the epoch number is larger than 1500, the BCE loss is almost converged and the corresponding AUC, ACC and AUPR values change in a small range.
Meanwhile, Figure 10 presents the correspondence values of hyperparameters in Equations ( 12) and ( 13) with the increase of the epoch number.In this study, MGCNSS chooses the α 1 , α 2 , α 3 and β 1 β 2 , and β 3 value when the epoch number is equal to 2000.Their corresponding values are 0.77, 0.19, 0.04 and 0.16, 0.82, 0.02, which indicates that the miRNA functional similarity network (α 1 ) and Gaussian kernel-based disease similarity network (β 2 ) have relatively higher weights.

The performance of MGCNSS under different negative sample selection strategy
As we know, the negative sample selection strategy is crucial to the performance of MGCNSS.Hence, there are some other negative sample selection strategies [5,60].Here, we choose the kmeans clustering strategy in [5] and compare it with our distancebased (DB) selection strategy.The corresponding results are displayed in Figure 11.
This study employs the AUC, ACC and AUPR metrics to evaluate the performance of MGCNSS with different negative sample selection strategies.We can see that MGCNSS with our DB strategy achieves higher scores, and the AUC, ACC and AUPR values are 0.9874, 0.9453 and 0.9882, respectively, while MGCNSS with the k-means strategy gets 0.9366, 0.8729 and 0.9301 on AUC, ACC and AUPR, respectively.The results in this sub-experiment further illustrate the effectiveness of the proposed negative sample selection strategy.

The results of MGCNSS and other approaches under statistical significance test
The statistical significance test is another commonly used manner to verify the resulting stability of each prediction model.Here, we employ the paired t-test model [61] to perform the significance analysis.Specifically, the results of MGCNSS and the baselines are paired in terms of AUC, ACC and AUPR metric values respectively.In the paired t-test, we set the significance level as 0.05.The null hypothesis (H 0 ) is that the performance of MGCNSS is not significantly better than baseline models on the given evaluation metrics, while the hypothesis (H 1 ) is that MGCNSS is significantly better than baseline models.If the P-value is less than 0.05, we will reject H 0 and accept H 1 .The corresponding results have been fully displayed in Table 4, which indicates that the performance of MGCNSS is superior to all the baseline approaches.

Visualization for the embeddings of miRNA-disease pairs learned by MGCNSS
To better demonstrate the effectiveness of MGCNSS, we display the embedding learning process of miRNA-disease pairs.Similar to the SCSMDA method [60], the positive pairs (red points) and negative pairs (blue points) are pre-selected and their embeddings are visualized by t-SNE tool [62] under different epochs.The results are shown in Figure 12.
It can be seen that the boundary between positive and negative pairs seems in chaos when the epoch is 0. With the increase of epoch number, the embeddings of positive and negative pairs become clear gradually.When the epoch number reaches 2000, the positive and negative pairs are almost separated with one 2.1E-10 6.7E-09 9.0E-11 MGCNSS vs DTIGAT [57] 1.0E-09 6.0E-08 4.8E-10 MGCNSS vs DTICNN [58] 5.7E-10 1.3E-08 2.4E-10 MGCNSS vs NeoDTI [59] 5.3E-11 5.0E-09 5.2E-11 MGCNSS vs MSGCL [3] 6.2E-11 3.6E-09 5.7E-10 MGCNSS vs AMHMDA [35] 1.3E-09 7.8E-09 6.0E-10 MGCNSS vs GATMDA [13] 1.3E-09 3.1E-08 1.1E-09 distinct boundary.Meanwhile, It is worth noting that even if the epoch number is 2000, there may still be overlapping between positive and negative pairs, indicating the presence of significant challenges in the prediction task of miRNA-disease associations.Overall, the results confirm the powerful ability of MGCNSS to learn the discriminative embeddings of miRNAs and diseases.Besides, to fully demonstrate the positions of positive samples and negative samples after executing our negative sample selection strategy, we visualize their 2D projections in Figure 13.Specifically, without the negative sample selection strategy, MGCNSS will divide the training samples into two categories, which are the positive samples (red points) and negative samples (blue points) in Figure 13(A).Meanwhile, after executing the proposed distancedbased negative sample selection strategy, MGCNSS divides the training samples into three categories (see Figure 13B), which are positive samples (red points), likely negative samples (green points) and likely positive samples (blue points).
Specifically, the local area A in Figure 13(A) contains the positive pairs and negative pairs.If there is no negative sample selection strategy, all the negative samples in area A could be the candidate negative samples.In fact, since the local area A gathers many ground-truth positive samples, the negative samples having a close distance to the ground-truth positive samples may not be the ground-truth negative samples.In other words, they may be the likely positive samples and should not be treated as candidate negative samples for training and testing.Correspondingly, in Figure 13(B), because of the negative sample selection strategy,  MGCNSS marked the negative samples in local area A as the likely positive samples (blue points), not the negative samples for training and testing.In this way, MGCNSS will select the negative samples from the likely negative samples instead of the likely positive samples.As a result, MGCNSS will establish a highquality negative sample set, which will improve its performance.The extensive results in the former section could further verify the effectiveness of the negative sample selection strategy.

Case study
We conduct the case study to further evaluate the performance of MGCNSS, which is similar to our previous research [29].Firstly, we train MGCNSS with all the miRNA-disease associations in the HMDDv2.0dataset.Then, we predict the potential associated miRNAs for the selected diseases.After that we screen out the top 50 predicted miRNAs based on their corresponding scores.To validate these predicted associations, we search the evidence based on HMDDv4.0 [63] and dbDEMC [64] database.Currently, in HMDDv4.0,there are 53 553 miRNA-disease association entries which include 1817 human miRNA genes, 79 virus-derived miR-NAs and 2379 diseases from 37 022 papers.Besides, dbDEMC is designed to store and display differentially expressed miRNAs in cancers detected by high-and low-throughput strategies.Its current version contains 2584 miRNAs and 40 cancer types for humans.
Specifically, disease colon neoplasms and esophageal neoplasms are selected for validation.Colon neoplasms is a common malignant tumor that occurs in the colon of the digestive tract [65].It ranks third in the incidence of gastrointestinal tumors and is increasing year by year, causing more than one million cases and 500 000 deaths annually.We first investigate the miRNAs associated with colon neoplasms, and the results are shown in Table 5.The results demonstrate that the top 50 predicted miRNAs have all been confirmed by HMDD or dbDEMC.As another high-incidence disease, esophageal neoplasm is a malignant tumor occurring in the esophageal tissue.Its morbidity and mortality are high, ranking 8th and 6th in all types of cancer, respectively [66].Early diagnosis is beneficial to improving the survival rate of patients.The corresponding results (Table 6) show that 49 of the top 50 can be verified by HMDD or dbDEMC.The results of the case study fully illustrate the ability of MGCNSS to detect novel associations between miRNAs and diseases.
Besides, to verify the ability of MGCNSS in finding novel miRNA-disease associations, we select Colon neoplasms and predict its top-10 associated miRNAs.Meanwhile, we also select part of the baseline approaches in Table 1 and predict their top-10 associated miRNAs of Colon neoplasms and sort their ranks according to their scores.The corresponding results are displayed in Table 7.The results demonstrate that all the top-10 predicted miRNAs by MGCNSS could be confirmed by HMDDv4.0 [63] and dbDEMC [64] database.In particular, hsa-Let-7a could be predicted by MGCNSS (Rank 10), while the remained approaches could not infer that this miRNA has an association relationship with Colon neoplasms in the top-10 predicted results.The miRNA hsa-Let-7a has been confirmed by PMID-31434447 in the HMDD database and SourceID-GSE2564 in dbDEMC.The results of this experiment could illustrate the stable prediction ability of the proposed model.
Moreover, to comprehensively investigate and compare the ability of MGCNSS to find novel associations, we conduct this experiment.Specifically, we choose Colon neoplasms as the target disease and collect its corresponding associated miRNAs by each comparison model with their predicted scores.All the predicted miRNAs for the comparison model will form their corresponding predicted miRNA set.Without loss of generality, we name the predicted miRNA set of MGCNSS as A, and the predicted miRNA set of each comparison approach as B.Then, we evaluate the results of MGCNSS with each baseline one by one under three metrics of A and B, which are |A∩B|/|A∪B|, |A|/|A∪B| and |B|/|A∪B|, respectively.The results are displayed in Table 8.Specifically, the

DISCUSSIONS
MGCNSS achieves the best performance among all the 11 miRNAdisease association prediction approaches.Extensive results could fully demonstrate its effectiveness and stability in different conditions.In this section, we would like to analyze the advantages and drawbacks as follows.

The meta-path based multi-layer graph convolution
In the miRNA-disease association network, there are two types of nodes and two types of edges.The meta-paths with different lengths have rich semantic meaning between nodes in this heterogeneous network.For example, meta-path miRNA1-miRNA2-disease1 denotes that if miRNA1 and miRNA2 have a higher similar value, and miRNA2 and disease1 have an association relationship, miRNA1 may also have an association relationship with disease1 with high probability.This assumption has been widely accepted and utilized [8].MGCNSS adopts the multi-layer graph convolution could capture the meta-paths with different lengths.It could comprehensively learn the embeddings of miRNAs and diseases, which could improve the prediction performance.The results of the ablation experiments demonstrate that the meta-based multi-layer graph convolution is essential for MGCNSS (see Figures 5 and 6).Besides, the performance of MGCNSS based on meta-paths with different lengths (See Table 3) is also well investigated.The results illustrate that meta-paths with length 1 and 2 have the best performance.The meta-path with length 3 may introduce noise, which lowers the indicators of MGCNSS in the experiment.In the next work, we would like to analyze the effect of network quality on the metapath-based embedding learning.

Negative sample selection strategy
According to the ablation study, we can find that the negative sample selection strategy has a great impact on the performance of MGCNSS.To verify the effectiveness of MGCNSS, we conduct the ablation study.The results show that the proposed negative sample selection strategy is of great help in improving the performance of MGCNSS.To visually demonstrate the role of negative sample selection, we depict two sub-figures in Figure 12, which indicates that with the increase of epoch number, the boundary between the positive and negative pairs is gradually clear.Besides, Figure 13 presents that MGCNSS could avoid selecting the negative samples that belong to the positive sample gather area (see Figure 13B).Our intuition is that samples in belong to the positive sample gather area should be the likely positive samples, not the likely negative samples.In this way, MGCNSS could select the more high-quality negative samples and have a better prediction result.

The result of case study
It is crucial to discover novel association relationships between miRNAs and diseases for each prediction model.To verify this ability of MGCNSS, we conduct this case study.In the first group experiment, MGCNSS is employed to infer miRNAs for disease colon neoplasm and esophageal neoplasm respectively.Results demonstrate that the proposed approach has a creditable performance.All the top-50 predicted miRNAs for colon neoplasms and 49 of the top 50 for esophageal neoplasm could be verified by HMDD or dbDEMC.

CONCLUSION
In this study, we proposed MGCNSS for miRNA-disease association prediction based on a multi-layer graph convolution and negative sample selection strategy.Specifically, MGCNSS employs multi-layer graph convolution to automatically capture the metapath relations with different lengths in the heterogeneous network and learn the discriminative representations of miRNAs and diseases.Besides, MGCNSS establishes a high-quality negative sample set by choosing the likely negative samples from the unlabeled sample set with the distanced-based sample selection strategy.The extensive results fully demonstrate that MGCNSS outperforms all baseline methods on the experimental dataset under different scenarios.The results of the case study further demonstrate the effectiveness of MGCNSS in miRNA-disease association prediction.
We will perform future works from the following three aspects.Firstly, some other biological entity association information such as miRNA-lncRNA associations could be employed for measuring the similarities of miRNAs and diseases from more comprehensive perspectives.In this way, a high-quality miRNA-disease heterogeneous network could be established, enabling learning more discriminative embeddings of miRNAs and drugs.Secondly, we could construct the miRNA-disease-related-biological knowledge graph, and predict the underlying associations between miRNA-disease by employing the knowledge graph embedding technique.Thirdly, since the association relationship prediction problem between different entities is one of the foundation tasks in bioinformatics, we would like to try to apply our proposed model to other link prediction problems, such as the disease-gene association, and microbe-drug association prediction task.

Key Points
• MGCNSS could automatically capture the rich semantics of meta-path with different lengths between miRNAs and diseases for learning their embeddings.
• A negative sample selection strategy is proposed to screen out high-quality negative samples to enhance the performance of the prediction model.• The results demonstrate that MGCNSS outperforms all baseline methods on the evaluation metrics.

Figure 2 .
Figure 2. A toy example for learning the importance of meta-paths.(A) There are two node types and two edge types.(B) The miRNA-disease heterogeneous network, where the numbers on yellow edges and blue edges represent the similarities between the different nodes and the association relationships, respectively.(C) For the meta-paths with length equal to 2, we display all the paths from node m 2 to d 1 .For each path, we first multiply the weights of its two edges to measure the total weight of the corresponding path.Then, MGCNSS could construct the weight matrix M 2 c in terms of the meta-paths with length 2. (D) The integrated result from m 2 to d 1 is shown in the weight matrix M 2 c .

Figure 3 .
Figure 3.The process of negative sample selection strategy.First, MGCNSS generates centroid vectors C p and C u from positive samples and the remaining unlabeled samples, respectively.Then, MGCNSS calculates the CS between each sample in the unlabeled sample set and the centroid vectors C p and C u , respectively.Based on the CS, we could divide the unlabeled samples into two groups, Likely Positive Pairs (LP) and Likely Negative Pairs (LN).Next, we update the two centroid vectors using LN and LP.Moreover, MGCNSS adopts ES to repeat these steps until the centroid vectors C p and C u converge.Finally, we regard LN as the reliable negative sample set.

Figure 4 .
Figure 4.The 5-folder cross-validation strategy used in this study.

miRNA-disease association predictions | 11 Figure 8 .
Figure 8.The parameter sensitivity analysis with different learning rates, embedding sizes and the number of convolution layer.

Figure 9 .
Figure 9.The change of AUC, ACC and AUPR values accompanied by the BCE loss under different epochs.

Figure 10 .
Figure 10.The change of different hyperparameters under different epochs.

Figure 11 .
Figure 11.The results of MGCNSS under different negative sample selection strategies.

Figure 12 .
Figure 12.Visualization for the embeddings of miRNA-disease pairs learned by MGCNSS under different epochs.

Figure 13 .
Figure 13.Comparison results of MGCNSS without and with negative sample screening strategy.

Table 7 :
Top 10 colon neoplasms-related miRNAs predicted by MGCNSS and other baseline approaches

Table 2 :
The evaluation results of MGCNSS and baseline methods on 1:5 and 1:10 ratios Note: Since MGCNSS w/o NSST is the variant of MGCNSS, we only compare it with MGCNSS.The best results are marked in bold and the second best is underlined.
MI, MGCNSS w/o MP and MGCNSS w/o SS by 3.22, 0.70 and 5.76% on the AUPR metric, respectively.MGCNSS achieves the best performance on all three metrics.The corresponding results are shown in Figure

Table 3 :
The performance of MGCNSS based on meta-paths with different lengths

Table 4 :
The statistical significance analysis for MGCNSS and baseline approaches

Table 6 :
Top 50 esophageal neoplasms-related miRNAs predicted by MGCNSS values in column |A|/|A∪B| are always larger than those in column |B|/|A ∪ B|.Taking MGCNSS and GATMDA for example, the value for |A|/|A∪B| is 0.9164, while the value for |B|/|A∪B| is 0.6104.From the results, we can find that MGCNSS outperforms other baselines in finding novel miRNA-disease associations.